Add support for native preemption retries #4342

jparraga-stackav · 2025-04-19T00:03:55Z

What type of PR is this?

New feature

What this PR does / why we need it:

This pull request add support for native Armada Preemption Retry Handling. Retry handling can be configured at the platform level as a default in the scheduling config as well as with two annotations:

armadaproject.io/preemptionRetryCountMax (defaults to 0 for now, ie. disabled if not configured)
armadaproject.io/preemptionRetryEnabled

The scheduling algorithm has been modified to not fail jobs that are preempted if they are eligible for a retry. If the job is eligible to be retried it will be marked to be requeued.

Unit tests are included. We've also tested this end to end in our development environment with jobs/gangs and combinations of successful retries as well as exhausting retries.

Which issue(s) this PR fixes:

Fixes: #4340

Special notes for your reviewer:

dejanzele · 2025-04-23T13:48:52Z

internal/common/preemption/utils.go

+
+// AreRetriesEnabled determines whether preemption retries are enabled at the job level. Also returns whether the
+// annotation was set.
+func AreRetriesEnabled(annotations map[string]string) (bool, bool) {


I'd name the return params, so it is easier to see what is what

d80tb7 · 2025-04-29T06:37:15Z

Hi,

thanks for this- it looks great. One thing I think we should consider/discuss is whether we could add the preemption retry fields as first-class proto fields rather than relying on annotations. Reasoning here is that anottions are very easy to add at first as they require no interface changes, but can soone get quite hard to work with as evwerything is just a map[string]string.

Personally I'd be in favour of adding these fields to schedulerobjects.JobSchedulingInfo right now with the view of adding them to api.Job once the feature is stable.

One comment I

jparraga-stackav · 2025-04-30T04:55:21Z

Personally I'd be in favour of adding these fields to schedulerobjects.JobSchedulingInfo right now with the view of adding them to api.Job once the feature is stable.

I looked into this but it seems a bit difficult to keep it is an annotation and not a first class citizen.

I'm not able to create the scheduling info from an armadaevents.SubmitJob since the annotations aren't there. I think there would also need to be some more thinking about how the global preemption retry config integrates into this. Might need to move that into the scheduler ingester to more elegantly handle that unless we want to reference the global preemption retry config all the time.

Cinojose · 2025-05-09T03:19:05Z

@jparraga-stackav Thanks for the work on this.
I was wondering if we could also handle imminent node shutdown scenarios. In addition to the current retry logic for pod evictions, it might be useful to check for pods terminated with the reason:

Pod was terminated in response to imminent node shutdown.

We could incorporate a simple check on pod.Status.Message to capture these cases. This would help cover both graceful shutdowns and node preemptions for various reasons.

Signed-off-by: Jason Parraga <[email protected]>

jparraga-stackav · 2025-05-22T18:24:41Z

@jparraga-stackav Thanks for the work on this. I was wondering if we could also handle imminent node shutdown scenarios. In addition to the current retry logic for pod evictions, it might be useful to check for pods terminated with the reason:

Pod was terminated in response to imminent node shutdown.

We could incorporate a simple check on pod.Status.Message to capture these cases. This would help cover both graceful shutdowns and node preemptions for various reasons.

I will likely look into this as a follow up improvement.

Signed-off-by: Jason Parraga <[email protected]>

jparraga-stackav · 2025-05-29T18:51:47Z

internal/scheduler/database/migrations/022_add_job_run_index.sql

@@ -1 +1 @@
-ALTER TABLE runs ADD COLUMN IF NOT EXISTS run_index bigint NOT NULL DEFAULT 0;
+ALTER TABLE runs ADD COLUMN IF NOT EXISTS run_index bigint;


During testing we found that the old scheduler ingester was trying to write rows with a null value which blocked ingestion from occurring. In order to make this smoother we've made this null-able and then handle null-able values in the app layer.

jparraga-stackav marked this pull request as ready for review April 19, 2025 00:08

dejanzele reviewed Apr 23, 2025

View reviewed changes

jparraga-stackav force-pushed the preemption-retries branch from 7591f89 to 8ba4635 Compare April 23, 2025 18:06

robertdavidsmith previously approved these changes Apr 29, 2025

View reviewed changes

jparraga-stackav dismissed robertdavidsmith’s stale review via 3f819bc April 30, 2025 03:30

jparraga-stackav force-pushed the preemption-retries branch from 99a74c5 to 3f819bc Compare April 30, 2025 03:30

Add support for native preemption retries

287daaa

Signed-off-by: Jason Parraga <[email protected]>

jparraga-stackav force-pushed the preemption-retries branch from 5c813fa to 287daaa Compare May 22, 2025 18:20

jparraga-stackav force-pushed the preemption-retries branch 2 times, most recently from 75306bd to dbd0cef Compare May 28, 2025 00:27

Plumb run index to pod name

7cf9c85

Signed-off-by: Jason Parraga <[email protected]>

jparraga-stackav force-pushed the preemption-retries branch from dbd0cef to 7cf9c85 Compare May 28, 2025 00:31

Make migration smoother with nullable value

71c1009

Signed-off-by: Jason Parraga <[email protected]>

jparraga-stackav commented May 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for native preemption retries #4342

Add support for native preemption retries #4342

Uh oh!

jparraga-stackav commented Apr 19, 2025 •

edited

Loading

Uh oh!

dejanzele Apr 23, 2025

Uh oh!

d80tb7 commented Apr 29, 2025 •

edited

Loading

Uh oh!

jparraga-stackav commented Apr 30, 2025

Uh oh!

Cinojose commented May 9, 2025 •

edited

Loading

Uh oh!

jparraga-stackav commented May 22, 2025

Uh oh!

jparraga-stackav May 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -1 +1 @@
		ALTER TABLE runs ADD COLUMN IF NOT EXISTS run_index bigint NOT NULL DEFAULT 0;
		ALTER TABLE runs ADD COLUMN IF NOT EXISTS run_index bigint;

Add support for native preemption retries #4342

Are you sure you want to change the base?

Add support for native preemption retries #4342

Uh oh!

Conversation

jparraga-stackav commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

dejanzele Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

d80tb7 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jparraga-stackav commented Apr 30, 2025

Uh oh!

Cinojose commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jparraga-stackav commented May 22, 2025

Uh oh!

jparraga-stackav May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jparraga-stackav commented Apr 19, 2025 •

edited

Loading

d80tb7 commented Apr 29, 2025 •

edited

Loading

Cinojose commented May 9, 2025 •

edited

Loading